{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "# SYCL Program Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Sections\n",
    "- [What is SYCL?](#What-is-SYCL?)\n",
    "- [SYCL Classes](#SYCL-Classes)\n",
    "  - [Device](#Device)\n",
    "  - _Code:_ [Device Selector](#Device-Selector)\n",
    "  - [Queue](#Queue)\n",
    "  - [Kernel](#Kernel)\n",
    "- [Parallel Kernels](#Parallel-Kernels)\n",
    "  - [Basic Parallel Kernels](#Basic-Parallel-Kernels)\n",
    "  - [ND-Range Kernels](#ND-Range-Kernels)\n",
    "- [Memory Models](#Memory-Models)\n",
    "  - _Code:_ [Vector Add implementation using USM and Buffers](#Vector-Add-implementation-using-USM-and-Buffers)\n",
    "  - [Unified Shared Memory Model](#Unified-Shared-Memory-Model)\n",
    "  - [Buffer Memory Model](#Buffer-Memory-Model)\n",
    "    - _Code:_ [Synchronization: Host Accessor](#Synchronization:-Host-Accessor)\n",
    "    - _Code:_ [Synchronization: Buffer Destruction](#Synchronization:-Buffer-Destruction)\n",
    "- _Code:_ [Custom Device Selector](#Custom-Device-Selector)\n",
    "- [Multi-GPU Selection](#Multi-GPU-Selection)\n",
    "- _Code:_ [Complex Number Multiplication](#Code-Sample:-Complex-Number-Multiplication)\n",
    "- _Lab Exercise:_ [Vector Add](#Lab-Exercise:-Vector-Add)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Learning Objectives\n",
    "* Explain the __SYCL__ fundamental classes\n",
    "* Use __device selection__ to offload kernel workloads\n",
    "* Decide when to use __basic parallel kernels__ and __ND Range Kernels__\n",
    "* Use __Unified Shared Memory__ or __Buffer-Accessor__ memory model in SYCL program\n",
    "* Build a sample __SYCL application__ through hands-on lab exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "tags": []
   },
   "source": [
    "## What is SYCL?\n",
    "_SYCL__ is an open standard to program for heterogeneous devicee in a single source. A SYCL program is invoked on the host computer and offloads the computation to an accelerator. Programmers use familiar C++ and library constructs with added functionalities like a __queue__ for work targeting, __buffer__ or __Unified Shared Memory__ for data management, and __parallel_for__ for parallelism to direct which parts of the computation and data should be offloaded."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SYCL Language and Runtime\n",
    "SYCL language and runtime consists of a set of C++ classes, templates, and libraries.\n",
    "\n",
    " __Application scope__ and __command group scope__:\n",
    " * Code that executes on the host\n",
    " * The full capabilities of C++ are available at application and command group scope \n",
    "\n",
    "__Kernel__ scope:\n",
    " * Code that executes on the device. \n",
    " * At __kernel__ scope there are __limitations__ in accepted C++\n",
    "\n",
    "\n",
    "\n",
    "#### C++ SYCL Code Example\n",
    "Let's look at a simple SYCL code to offload computation to GPU, the code does the following:\n",
    "1. selects GPU device for offload\n",
    "2. allocates memory that can be accessed on host and GPU\n",
    "3. initializes data array on host\n",
    "4. offloads computation to GPU\n",
    "5. prints output on host\n",
    "\n",
    "\n",
    "```cpp\n",
    "#include <sycl/sycl.hpp>\n",
    "static const int N = 16;\n",
    "int main(){\n",
    "  sycl::queue q(sycl::gpu_device_selector_v); // <--- select GPU for offload \n",
    "\n",
    "  int *data = sycl::malloc_shared<int>(N, q); // <--- allocate memory\n",
    "\n",
    "  for(int i=0; i<N; i++) data[i] = i;\n",
    "\n",
    "  q.parallel_for(N, [=] (auto i){\n",
    "    data[i] *= 2;  // <--- Kernel Code (executes on GPU)\n",
    "  }).wait();\n",
    "\n",
    "  for(int i=0; i<N; i++) std::cout << data[i] << \"\\n\";\n",
    "\n",
    "  sycl::free(data, q);\n",
    "  return 0;\n",
    "}\n",
    "```\n",
    "\n",
    "In the next few sections we will learn the basics of C++ SYCL programming."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## SYCL Classes\n",
    "\n",
    "Below are some important SYCL Classes that are used to write a C++ with SYCL program to offload computation to heterogeneous devices."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Device\n",
    "The __device__ class represents the capabilities of the accelerators in a system utilizing Intel&reg; oneAPI Toolkits. The device class contains member functions for querying information about the device, which is useful for SYCL programs where multiple devices are created.\n",
    "* The function __get_info__ gives information about the device:\n",
    " * Name, vendor, and version of the device\n",
    " * The local and global work item IDs\n",
    " * Width for built in types, clock frequency, cache width and sizes, online or offline\n",
    " \n",
    "```cpp\n",
    "queue q;\n",
    "device my_device = q.get_device();\n",
    "std::cout << \"Device: \" << my_device.get_info<info::device::name>() << \"\\n\";\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Device Selector\n",
    "These classes enable the runtime selection of a particular device to execute kernels based upon user-provided heuristics. The following code sample shows use of the standard device selectors (__default_selector_v, cpu_selector_v, gpu_selector_v, accelerator_selector_v__)\n",
    "\n",
    " \n",
    "```cpp\n",
    "queue q(gpu_selector_v);\n",
    "//queue q(cpu_selector_v);\n",
    "//queue q(accelerator_selector_v);\n",
    "//queue q(default_selector_v);\n",
    "//queue q;\n",
    "\n",
    "std::cout << \"Device: \" << q.get_device().get_info<info::device::name>() << \"\\n\";\n",
    "```\n",
    "\n",
    "The SYCL code below shows different device selectors: Inspect code, there are no modifications necessary:\n",
    "1. Inspect the code cell below and click run ▶ to save the code to file\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile lab/gpu_sample.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "#include <sycl/sycl.hpp>\n",
    "\n",
    "using namespace sycl;\n",
    "\n",
    "int main() {\n",
    "  //# Create a device queue with device selector\n",
    "  \n",
    "  queue q(gpu_selector_v);\n",
    "  //queue q(cpu_selector_v);\n",
    "  //queue q(accelerator_selector_v);\n",
    "  //queue q(default_selector_v);\n",
    "  //queue q;\n",
    "\n",
    "  //# Print the device name\n",
    "  std::cout << \"Device: \" << q.get_device().get_info<info::device::name>() << \"\\n\";\n",
    "\n",
    "  return 0;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! chmod 755 q; chmod 755 run_gpu.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_gpu.sh; else ./run_gpu.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Queue\n",
    "__Queue__ submits command groups to be executed by the SYCL runtime. Queue is a mechanism where __work is submitted__ to a device.A queue map to one device and multiple queues can be mapped to the same device.\n",
    " \n",
    "```cpp\n",
    "q.submit([&](handler& h) {\n",
    "    //COMMAND GROUP CODE\n",
    "});\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Kernel\n",
    "The __kernel__ class encapsulates methods and data for executing code on the device when a command group is instantiated. Kernel object is not explicitly constructed by the user and is constructed when a kernel dispatch function, such as __parallel_for__, is called \n",
    " ```cpp\n",
    " q.submit([&](handler& h) {\n",
    "  h.parallel_for(range<1>(N), [=](id<1> i) {\n",
    "    A[i] = B[i] + C[i]);\n",
    "  });\n",
    "});\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Choosing where device kernels run\n",
    "\n",
    "Work is submitted to queues and each queue is associated with exactly one device (e.g. a specific GPU or FPGA). You can decide which device a queue is associated with (if you want) and have as many queues as desired for dispatching work in heterogeneous systems.        \n",
    "\n",
    "|Target Device |Queue|\n",
    "|-----|-------|\n",
    "|Create queue targeting any device: | queue() |\n",
    "| Create queue targeting a pre-configured classes of devices: | queue(cpu_selector_v); queue(gpu_selector_v); queue(accelerator_selector_v); queue(default_selector_v);|\n",
    "|Create queue targeting specific device (custom criteria): | queue(custom_selector); |                    \n",
    "                                                          \n",
    "                                                               \n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "        \n",
    "<img src=\"Assets/queue.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parallel Kernels\n",
    "\n",
    "__Parallel Kernel__ allows multiple instances of an operation to execute in parallel. This is useful to __offload__ parallel execution of a basic __for-loop__ in which each iteration is completely independent and in any order. Parallel kernels are expressed using the __parallel_for__ function\n",
    "A simple 'for' loop in a C++ application is written as below\n",
    "\n",
    "```cpp\n",
    "for(int i=0; i < 1024; i++){\n",
    "    a[i] = b[i] + c[i];\n",
    "});\n",
    "```\n",
    "\n",
    "Below is how you can offload to accelerator\n",
    "\n",
    "```cpp\n",
    "q.parallel_for(range<1>(1024), [=](id<1> i){\n",
    "    A[i] =  B[i] + C[i];\n",
    "});\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic Parallel Kernels\n",
    "\n",
    "The functionality of basic parallel kernels is exposed via __range__, __id__, and __item__ classes. __Range__ class is used to describe the __iteration space__ of parallel execution and __id__ class is used to __index__ an individual instance of a kernel in a parallel execution\n",
    "\n",
    "\n",
    "```cpp\n",
    "q.parallel_for(range<1>(1024), [=](id<1> i){\n",
    "// CODE THAT RUNS ON DEVICE \n",
    "});\n",
    "\n",
    "```\n",
    "The above example is good if all you need is the __index (id)__, but if you need the __range__ value in your kernel code, then you can use __item__ class instead of __id__ class, which you can use to query for the __range__ as shown below.  __item__ class represents an __individual instance__ of a kernel function, exposes additional functions to query properties of the execution range\n",
    "\n",
    "\n",
    "```cpp\n",
    "q.parallel_for(range<1>(1024), [=](item<1> item){\n",
    "    auto i = item.get_id();\n",
    "    auto R = item.get_range();\n",
    "    // CODE THAT RUNS ON DEVICE\n",
    "    \n",
    "    \n",
    "});\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ND-Range Kernels\n",
    "Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. __ND-Range kernel__ is another way to expresses parallelism which enable low level performance tuning by providing access to __local memory and mapping executions__ to compute units on hardware. The entire iteration space is divided into smaller groups called __work-groups__, __work-items__ within a work-group are scheduled on a single compute unit on hardware.\n",
    "\n",
    "The grouping of kernel executions into work-groups  will allow control of resource usage and load balance work distribution.The functionality of nd_range kernels is exposed via __nd_range__ and __nd_item__ classes. __nd_range__ class represents a __grouped execution range__ using global execution range and the local execution range of each work-group. __nd_item__ class  represents an __individual instance__ of a kernel function and allows to query for work-group range and index.\n",
    "\n",
    "```cpp\n",
    "q.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){\n",
    "    auto idx = item.get_global_id();\n",
    "    auto local_id = item.get_local_id();\n",
    "    // CODE THAT RUNS ON DEVICE\n",
    "});\n",
    "```\n",
    "<img src=\"Assets/ndrange.png\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Memory Models\n",
    "\n",
    "A SYCL application can be written using one of the 2 memory models:\n",
    "- Unified Shared Memory Model (USM)\n",
    "- Buffer Memory Model\n",
    "\n",
    "__Unified Shared Memory__ Model is pointer-based approach to memory model, similar to C/C++ pointer-based memory allocation. Makes migrating C/C++/CUDA* application to SYCL easier. Dependencies between multiple kernels are explicitly handled using events.\n",
    "\n",
    "__Buffer Memory Model__ allows a new memory abstraction called buffers and are accessed using accessors which allows setting read/write permissions and other properties to memory. Allows data representation in 1,2 or 3-dimentions and makes programming kernels with 2/3-dimentional data easier. Dependencies between multiple kernels are implicitly handled.\n",
    "\n",
    "\n",
    "#### Vector Add implementation using USM and Buffers\n",
    "The SYCL code below shows vector add computation implemented using USM and Buffers memory model: Inspect code, there are no modifications necessary:\n",
    "1. Inspect the code cell below and click run ▶ to save the code to file\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile lab/vector_add_usm_buffers.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "#include <sycl/sycl.hpp>\n",
    "\n",
    "using namespace sycl;\n",
    "\n",
    "// kernel function to compute vector add using Unified Shared memory model (USM)\n",
    "void kernel_usm(int* a, int* b, int* c, int N) {\n",
    "  //Step 1: create a device queue\n",
    "  queue q;\n",
    "  //Step 2: create USM device allocation\n",
    "  auto a_device = malloc_device<int>(N, q); \n",
    "  auto b_device = malloc_device<int>(N, q); \n",
    "  auto c_device = malloc_device<int>(N, q); \n",
    "  //Step 3: copy memory from host to device\n",
    "  q.memcpy(a_device, a, N*sizeof(int));\n",
    "  q.memcpy(b_device, b, N*sizeof(int));\n",
    "  q.wait();\n",
    "  //Step 4: send a kernel (lambda) for execution\n",
    "  q.parallel_for(N, [=](auto i){\n",
    "    //Step 5: write a kernel\n",
    "    c_device[i] = a_device[i] + b_device[i];\n",
    "  }).wait();\n",
    "  //Step 6: copy the result back to host\n",
    "  q.memcpy(c, c_device, N*sizeof(int)).wait();\n",
    "  //Step 7: free device allocation\n",
    "  free(a_device, q);\n",
    "  free(b_device, q);\n",
    "  free(c_device, q);\n",
    "}\n",
    "\n",
    "// kernel function to compute vector add using Buffer memory model\n",
    "void kernel_buffers(int* a, int* b, int* c, int N) {\n",
    "  //Step 1: create a device queue\n",
    "  queue q;\n",
    "  //Step 2: create buffers \n",
    "  buffer buf_a(a, range<1>(N));\n",
    "  buffer buf_b(b, range<1>(N));\n",
    "  buffer buf_c(c, range<1>(N));\n",
    "  //Step 3: submit a command for (asynchronous) execution\n",
    "  q.submit([&](handler &h){\n",
    "    //Step 4: create buffer accessors to access buffer data on the device\n",
    "    accessor A(buf_a, h, read_only);\n",
    "    accessor B(buf_b, h, read_only);\n",
    "    accessor C(buf_c, h, write_only);\n",
    "    //Step 5: send a kernel (lambda) for execution\n",
    "    h.parallel_for(N, [=](auto i){\n",
    "      //Step 6: write a kernel\n",
    "      C[i] = A[i] + B[i];\n",
    "    });\n",
    "  });\n",
    "}\n",
    "\n",
    "int main() {\n",
    "  // initialize data arrays on host\n",
    "  constexpr int N = 256;\n",
    "  int a[N], b[N], c[N];\n",
    "  for (int i=0; i<N; i++){\n",
    "    a[i] = 1;\n",
    "    b[i] = 2;\n",
    "  }\n",
    "    \n",
    "  // initialize c = 0 and offload computation using USM, print output \n",
    "  for (int i=0; i<N; i++) c[i] = 0;\n",
    "  kernel_usm(a, b, c, N);\n",
    "  std::cout << \"Vector Add Output (USM): \\n\";\n",
    "  for (int i=0; i<N; i++)std::cout << c[i] << \" \";std::cout << \"\\n\";\n",
    "\n",
    "  // initialize c = 0 and offload computation using USM, print output \n",
    "  for (int i=0; i<N; i++) c[i] = 0;\n",
    "  std::cout << \"Vector Add Output (Buffers): \\n\";\n",
    "  kernel_buffers(a, b, c, N);\n",
    "  for (int i=0; i<N; i++)std::cout << c[i] << \" \";std::cout << \"\\n\";\n",
    "    \n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! chmod 755 q; chmod 755 run_vector_add_usm_buffers.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_vector_add_usm_buffers.sh; else ./run_vector_add_usm_buffers.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Unified Shared Memory Model\n",
    "\n",
    "USM is a __pointer-based approach__ that should be familiar to C and C++ programmers who use malloc or new to allocate data. USM __simplifies development__ for the programmer when __porting existing C/C++/CUDA code__ to SYCL.\n",
    "\n",
    "\n",
    "#### SYCL Code Anotomy - USM\n",
    "\n",
    "Programs which utilize oneAPI require the include of __sycl/sycl.hpp__. It is recommended to employ the namespace statement to save typing repeated references into the sycl namespace.\n",
    "\n",
    "```cpp\n",
    "#include <sycl/sycl.hpp>\n",
    "using namespace sycl;\n",
    "```\n",
    "\n",
    "__SYCL programs__ are standard C++. The program is invoked on the __host__ computer, and offloads computation to the __accelerator__. A programmer uses SYCL’s __queue and kernel abstractions__ to direct which parts of the computation and data should be offloaded.\n",
    "\n",
    "As a first step in a SYCL program we create a __queue__. We offload computation to a __device__ by submitting tasks to a queue. The programmer can choose CPU, GPU, FPGA, and other devices through the __selector__. This program uses the default  q here, which means SYCL runtime selects the most capable device available at runtime by using the default selector.\n",
    "\n",
    "Device and host can either share physical __memory__ or have distinct memories. When the memories are distinct, offloading computation requires __copying data between host and device__. We use USM device allocation `malloc_device` to allocate memory on device and copy data between host and device using `memcpy` method.\n",
    "\n",
    "In a SYCL program, we define a __kernel__, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a __C++ lambda function__. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The __parallel_for__ in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional __range from 0 to N-1__.\n",
    "\n",
    "The code below shows Simple Vector addition using SYCL. Read through the comments addressed in step 1 through step 6.\n",
    "\n",
    "```cpp\n",
    "void SYCL_code(int* a, int* b, int* c, int N) {\n",
    "  //Step 1: create a device queue\n",
    "  //(developer can specify a device type via device selector or use default selector)\n",
    "  queue q;\n",
    "  //Step 2: create USM device allocation\n",
    "  auto a_device = malloc_device<int>(N, q); \n",
    "  auto b_device = malloc_device<int>(N, q); \n",
    "  auto c_device = malloc_device<int>(N, q); \n",
    "  //Step 3: copy memory from host to device\n",
    "  q.memcpy(a_device, a, N*sizeof(int));\n",
    "  q.memcpy(b_device, b, N*sizeof(int));\n",
    "  q.wait();\n",
    "  //Step 4: send a kernel (lambda) for execution\n",
    "  q.parallel_for(N, [=](auto i){\n",
    "    //Step 5: write a kernel\n",
    "    //Kernel invocations are executed in parallel\n",
    "    //Kernel is invoked for each element of the range\n",
    "    //Kernel invocation has access to the invocation id\n",
    "    c_device[i] = a_device[i] + b_device[i];\n",
    "  }).wait();\n",
    "  //Step 6: copy the result back to host\n",
    "  q.memcpy(c, c_device, N*sizeof(int)).wait();\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Buffer Memory Model\n",
    "__Buffers encapsulate__ data in a SYCL application across both devices and host. __Accessors__ is the mechanism to access buffer data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### SYCL Code Anatomy - Buffer Model\n",
    "Programs which utilize oneAPI require the include of __sycl/sycl.hpp__. It is recommended to employ the namespace statement to save typing repeated references into the sycl namespace.\n",
    "\n",
    "```cpp\n",
    "#include <sycl/sycl.hpp>\n",
    "using namespace sycl;\n",
    "```\n",
    "\n",
    "__SYCL programs__ are standard C++. The program is invoked on the __host__ computer, and offloads computation to the __accelerator__. A programmer uses SYCL’s __queue, buffer, device, and kernel abstractions__ to direct which parts of the computation and data should be offloaded.\n",
    "\n",
    "As a first step in a SYCL program we create a __queue__. We offload computation to a __device__ by submitting tasks to a queue. The programmer can choose CPU, GPU, FPGA, and other devices through the __selector__. This program uses the default  q here, which means SYCL runtime selects the most capable device available at runtime by using the default selector. We will talk about the devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules but below is a simple SYCL program for you to get started with the above concepts.\n",
    "\n",
    "Device and host can either share physical __memory__ or have distinct memories. When the memories are distinct, offloading computation requires __copying data between host and device__. SYCL does not require the programmer to manage the data copies. By creating __Buffers and Accessors__, SYCL ensures that the data is available to host and device without any programmer effort. SYCL also allows the programmer explicit control over data movement when it is necessary to achieve best peformance.\n",
    "\n",
    "In a SYCL program, we define a __kernel__, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a __C++ lambda function__. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The __parallel_for__ in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional __range from 0 to N-1__.\n",
    "\n",
    "\n",
    "The code below shows Simple Vector addition using SYCL. Read through the comments addressed in step 1 through step 6.\n",
    "\n",
    "```cpp\n",
    "void SYCL_code(int* a, int* b, int* c, int N) {\n",
    "  //Step 1: create a device queue\n",
    "  //(developer can specify a device type via device selector or use default selector)\n",
    "  queue q;\n",
    "  //Step 2: create buffers (represent both host and device memory)\n",
    "  buffer buf_a(a, range<1>(N));\n",
    "  buffer buf_b(b, range<1>(N));\n",
    "  buffer buf_c(c, range<1>(N));\n",
    "  //Step 3: submit a command for (asynchronous) execution\n",
    "  q.submit([&](handler &h){\n",
    "    //Step 4: create buffer accessors to access buffer data on the device\n",
    "    accessor A(buf_a,h,read_only);\n",
    "    accessor B(buf_b,h,read_only);\n",
    "    accessor C(buf_c,h,write_only);\n",
    "  \n",
    "    //Step 5: send a kernel (lambda) for execution\n",
    "    h.parallel_for(N, [=](auto i){\n",
    "      //Step 6: write a kernel\n",
    "      //Kernel invocations are executed in parallel\n",
    "      //Kernel is invoked for each element of the range\n",
    "      //Kernel invocation has access to the invocation id\n",
    "      C[i] = A[i] + B[i];\n",
    "    });\n",
    "  });\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Synchronization: Host Accessor\n",
    "\n",
    "The Host Accessor is an accessor which uses host buffer access target. It is created outside of the scope of the command group and the data that this gives access to will be available on the host. These are used to synchronize the data back to the host by constructing the host accessor objects. Buffer destruction is the other way to synchronize the data back to the host.\n",
    "\n",
    "Buffer takes ownership of the data stored in vector. Creating host accessor is a __blocking call__ and will only return after all enqueued SYCL kernels that modify the same buffer in any queue completes execution and the data is available to the host via this host accessor.\n",
    "\n",
    "The SYCL code below demonstrates Synchronization with Host Accessor: Inspect code, there are no modifications necessary:\n",
    "\n",
    "\n",
    "1. Inspect the code cell below and click run ▶ to save the code to file\n",
    "\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile lab/host_accessor_sample.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "\n",
    "#include <sycl/sycl.hpp>\n",
    "using namespace sycl;\n",
    "\n",
    "int main() {\n",
    "  constexpr int N = 16;\n",
    "  auto R = range<1>(N);\n",
    "  std::vector<int> v(N, 10);\n",
    "  queue q;\n",
    "  // Buffer takes ownership of the data stored in vector.  \n",
    "  buffer buf(v);\n",
    "  q.submit([&](handler& h) {\n",
    "    accessor a(buf,h);\n",
    "    h.parallel_for(R, [=](auto i) { a[i] -= 2; });\n",
    "  });\n",
    "  // Creating host accessor is a blocking call and will only return after all\n",
    "  // enqueued SYCL kernels that modify the same buffer in any queue completes\n",
    "  // execution and the data is available to the host via this host accessor.\n",
    "  host_accessor b(buf,read_only);\n",
    "  for (int i = 0; i < N; i++) std::cout << b[i] << \" \";\n",
    "  return 0;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! chmod 755 q; chmod 755 run_host_accessor.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_host_accessor.sh; else ./run_host_accessor.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Synchronization: Buffer Destruction\n",
    "In the below example Buffer creation happens within a separate function scope. When execution advances beyond this __function scope__, buffer destructor is invoked which relinquishes the ownership of data and copies back the data to the host memory.\n",
    "\n",
    "The SYCL code below demonstrates Synchronization with Buffer Destruction: Inspect code, there are no modifications necessary:\n",
    "\n",
    "1. Inspect the code cell below and click run ▶ to save the code to a file.\n",
    "\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile lab/buffer_destruction2.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "\n",
    "#include <sycl/sycl.hpp>\n",
    "constexpr int N = 16;\n",
    "using namespace sycl;\n",
    "\n",
    "// Buffer creation happens within a separate function scope.\n",
    "void SYCL_code(std::vector<int> &v, queue &q) {\n",
    "  auto R = range<1>(N);\n",
    "  buffer buf(v);\n",
    "  q.submit([&](handler &h) {\n",
    "    accessor a(buf,h);\n",
    "    h.parallel_for(R, [=](auto i) { a[i] -= 2; });\n",
    "  });\n",
    "}\n",
    "int main() {\n",
    "  std::vector<int> v(N, 10);\n",
    "  queue q;\n",
    "  SYCL_code(v, q);\n",
    "  // When execution advances beyond this function scope, buffer destructor is\n",
    "  // invoked which relinquishes the ownership of data and copies back the data to\n",
    "  // the host memory.\n",
    "  for (int i = 0; i < N; i++) std::cout << v[i] << \" \";\n",
    "  return 0;\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! chmod 755 q; chmod 755 run_buffer_destruction.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_buffer_destruction.sh; else ./run_buffer_destruction.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Custom Device Selector\n",
    "The following code shows custom device selector using your own logic. The selected device prioritizes a GPU device because the integer rating returned is higher than for CPU or other accelerator. \n",
    "\n",
    "\n",
    "#### Example of custom device selector with specific vendor name\n",
    "```cpp\n",
    "// Return 1 if the vendor name is \"Intel\" or 0 else.\n",
    "// 0 does not prevent another device to be picked as a second choice\n",
    "int custom_device_selector(const sycl::device& d ) {\n",
    "  return d.get_info<sycl::info::device::vendor>() == \"Intel\";\n",
    "}\n",
    "\n",
    "sycl::device preferred_device { custom_device_selector };\n",
    "sycl::queue q(preferred_device);\n",
    "```\n",
    "\n",
    "#### Example of custom device selector with specific GPU device name\n",
    "```cpp\n",
    "// Return 1 if device is GPU and name has \"Intel\"\n",
    "int custom_device_selector(const sycl::device& d ) {\n",
    "  return dev.is_gpu() & (dev.get_info<info::device::name>().find(\"Intel\") != std::string::npos);\n",
    "}\n",
    "\n",
    "sycl::device preferred_device { custom_device_selector };\n",
    "sycl::queue q(preferred_device);\n",
    "```\n",
    "\n",
    "#### Example of custom device selector with priority based on device\n",
    "```cpp\n",
    "// Highest priority for Xeon device, then any GPU, then any CPU.\n",
    "int custom_device_selector(const sycl::device& d ) {\n",
    "  int rating = 0;\n",
    "  if (d.get_info<info::device::name>().find(\"Xeon\") != std::string::npos)) rating = 3;\n",
    "  else if (d.is_gpu()) rating = 2;\n",
    "  else if (d.is_cpu()) rating = 1;\n",
    "  return rating;    \n",
    "}\n",
    "\n",
    "sycl::device preferred_device { custom_device_selector };\n",
    "sycl::queue q(preferred_device);\n",
    "```\n",
    "\n",
    "\n",
    "The SYCL code below demonstrates Custom Device Selector: Inspect code, there are no modifications necessary:\n",
    "\n",
    "1. Inspect the code cell below and click run ▶ to save the code to a file.\n",
    "\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile lab/custom_device_sample.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "#include <sycl/sycl.hpp>\n",
    "#include <iostream>\n",
    "using namespace sycl;\n",
    "class my_device_selector {\n",
    "public:\n",
    "    my_device_selector(std::string vendorName) : vendorName_(vendorName){};\n",
    "    int operator()(const device& dev) const {\n",
    "    int rating = 0;\n",
    "    //We are querying for the custom device specific to a Vendor and if it is a GPU device we\n",
    "    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to\n",
    "    //CPU device.\n",
    "    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) != std::string::npos))\n",
    "        rating = 3;\n",
    "    else if (dev.is_gpu()) rating = 2;\n",
    "    else if (dev.is_cpu()) rating = 1;\n",
    "    return rating;\n",
    "    };\n",
    "    \n",
    "private:\n",
    "    std::string vendorName_;\n",
    "};\n",
    "int main() {\n",
    "    //pass in the name of the vendor for which the device you want to query \n",
    "    std::string vendor_name = \"Intel\";\n",
    "    //std::string vendor_name = \"AMD\";\n",
    "    //std::string vendor_name = \"Nvidia\";\n",
    "    my_device_selector selector(vendor_name);\n",
    "    queue q(selector);\n",
    "    std::cout << \"Device: \"\n",
    "    << q.get_device().get_info<info::device::name>() << \"\\n\";\n",
    "    return 0;\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! chmod 755 q; chmod 755 run_custom_device.sh;if [ -x \"$(command -v qsub)\" ]; then ./q run_custom_device.sh; else ./run_custom_device.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Multi-GPU Selection\n",
    "\n",
    "To submit job to a single GPU, we use `sycl::device` class with `sycl::gpu_selector_v` to find GPU device on the system and then create `sycl::queue` with this device as shown below:\n",
    "\n",
    "```cpp\n",
    "auto gpu = sycl::device(sycl::gpu_selector_v);\n",
    "\n",
    "sycl::queue q(gpu);\n",
    "```\n",
    "\n",
    "To find multiple GPU device in the system, `sycl::platform` class is used to query all devices in a system, `sycl::gpu_selector_v` is used to filter only GPU devices, the `get_devices()` method will create a vector of GPU devices found.\n",
    "\n",
    "```cpp\n",
    "auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();\n",
    "\n",
    "sycl::queue q_gpu1(gpus[0]);\n",
    "sycl::queue q_gpu2(gpus[1]);\n",
    "```\n",
    "\n",
    "Once we have found all the GPU devices, we create `sycl::queue` for each GPU device and submit job for GPU devices.\n",
    "\n",
    "The code below shows how to find multiple GPU devices on a system and submit different kernels to different GPU devices\n",
    "\n",
    "```cpp\n",
    "  // Get all GPU device in platform\n",
    "  auto gpus = sycl::platform(sycl::gpu_selector_v).get_devices();\n",
    "\n",
    "  // create a vector for queue\n",
    "  std::vector<sycl::queue> q;\n",
    "  for (auto &gpu : gpus) {\n",
    "    // create queue for each device and add to vector\n",
    "    q.push_back(queue(gpu));\n",
    "  }\n",
    "\n",
    "  // Submit kernels to multiple GPUs\n",
    "  if (gpus.size() >= 2){\n",
    "    q[0].parallel_for(N, [=](auto i){\n",
    "      //...\n",
    "    });\n",
    "\n",
    "    q[1].parallel_for(N, [=](auto i){\n",
    "      //...\n",
    "    });\n",
    "  }\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Code Sample: Complex Number Multiplication\n",
    "The following is the definition of a custom class type that represents complex numbers.\n",
    "* The file [Complex.hpp](./src/Complex.hpp) defines the Complex2 class.\n",
    "* The Complex2 Class got two member variables \"real\" and \"imag\" of type int.\n",
    "* The Complex2 class got a member function for performing complex number multiplication. The function complex_mul returns the object of type Complex2 performing the multiplication of two complex numbers.\n",
    "* We are going to call complex_mul function from our SYCL code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Inspect the code cell below, click run ▶ to save the code to file.\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting lab/complex_mult.cpp\n"
     ]
    }
   ],
   "source": [
    "%%writefile lab/complex_mult.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "#include <sycl/sycl.hpp>\n",
    "#include <iomanip>\n",
    "#include <vector>\n",
    "#include \"Complex.hpp\"\n",
    "\n",
    "using namespace sycl;\n",
    "using namespace std;\n",
    "\n",
    "// Number of complex numbers passing to the SYCL code\n",
    "static const int num_elements = 10000;\n",
    "\n",
    "class CustomDeviceSelector {\n",
    " public:\n",
    "  CustomDeviceSelector(std::string vendorName) : vendorName_(vendorName){};\n",
    "  int operator()(const device &dev) {\n",
    "    int device_rating = 0;\n",
    "    //We are querying for the custom device specific to a Vendor and if it is a GPU device we\n",
    "    //are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to\n",
    "    //CPU device. \n",
    "    if (dev.is_gpu() & (dev.get_info<info::device::name>().find(vendorName_) !=\n",
    "                        std::string::npos))\n",
    "      device_rating = 3;\n",
    "    else if (dev.is_gpu())\n",
    "      device_rating = 2;\n",
    "    else if (dev.is_cpu())\n",
    "      device_rating = 1;\n",
    "    return device_rating;\n",
    "  };\n",
    "\n",
    " private:\n",
    "  std::string vendorName_;\n",
    "};\n",
    "\n",
    "// in_vect1 and in_vect2 are the vectors with num_elements complex nubers and\n",
    "// are inputs to the parallel function\n",
    "void SYCLParallel(queue &q, std::vector<Complex2> &in_vect1,\n",
    "                   std::vector<Complex2> &in_vect2,\n",
    "                   std::vector<Complex2> &out_vect) {\n",
    "  auto R = range(in_vect1.size());\n",
    "  if (in_vect2.size() != in_vect1.size() || out_vect.size() != in_vect1.size()){ \n",
    "    std::cout << \"ERROR: Vector sizes do not  match\"<< \"\\n\";\n",
    "    return;\n",
    "  }\n",
    "  // Setup input buffers\n",
    "  buffer bufin_vect1(in_vect1);\n",
    "  buffer bufin_vect2(in_vect2);\n",
    "\n",
    "  // Setup Output buffers \n",
    "  buffer bufout_vect(out_vect);\n",
    "\n",
    "  std::cout << \"Target Device: \"\n",
    "            << q.get_device().get_info<info::device::name>() << \"\\n\";\n",
    "  // Submit Command group function object to the queue\n",
    "  q.submit([&](auto &h) {\n",
    "    // Accessors set as read mode\n",
    "    accessor V1(bufin_vect1,h,read_only);\n",
    "    accessor V2(bufin_vect2,h,read_only);\n",
    "    // Accessor set to Write mode\n",
    "    accessor V3 (bufout_vect,h,write_only);\n",
    "    h.parallel_for(R, [=](auto i) {\n",
    "      V3[i] = V1[i].complex_mul(V2[i]);\n",
    "    });\n",
    "  });\n",
    "  q.wait_and_throw();\n",
    "}\n",
    "void Scalar(std::vector<Complex2> &in_vect1,\n",
    "                 std::vector<Complex2> &in_vect2,\n",
    "                 std::vector<Complex2> &out_vect) {\n",
    "  if ((in_vect2.size() != in_vect1.size()) || (out_vect.size() != in_vect1.size())){\n",
    "    std::cout<<\"ERROR: Vector sizes do not match\"<<\"\\n\";\n",
    "    return;\n",
    "    }\n",
    "  for (int i = 0; i < in_vect1.size(); i++) {\n",
    "    out_vect[i] = in_vect1[i].complex_mul(in_vect2[i]);\n",
    "  }\n",
    "}\n",
    "// Compare the results of the two output vectors from parallel and scalar. They\n",
    "// should be equal\n",
    "int Compare(std::vector<Complex2> &v1, std::vector<Complex2> &v2) {\n",
    "  int ret_code = 1;\n",
    "  if(v1.size() != v2.size()){\n",
    "    ret_code = -1;\n",
    "  }\n",
    "  for (int i = 0; i < v1.size(); i++) {\n",
    "    if (v1[i] != v2[i]) {\n",
    "      ret_code = -1;\n",
    "      break;\n",
    "    }\n",
    "  }\n",
    "  return ret_code;\n",
    "}\n",
    "int main() {\n",
    "  // Declare your Input and Output vectors of the Complex2 class\n",
    "  vector<Complex2> input_vect1;\n",
    "  vector<Complex2> input_vect2;\n",
    "  vector<Complex2> out_vect_parallel;\n",
    "  vector<Complex2> out_vect_scalar;\n",
    "\n",
    "  for (int i = 0; i < num_elements; i++) {\n",
    "    input_vect1.push_back(Complex2(i + 2, i + 4));\n",
    "    input_vect2.push_back(Complex2(i + 4, i + 6));\n",
    "    out_vect_parallel.push_back(Complex2(0, 0));\n",
    "    out_vect_scalar.push_back(Complex2(0, 0));\n",
    "  }\n",
    "\n",
    "  // Initialize your Input and Output Vectors. Inputs are initialized as below.\n",
    "  // Outputs are initialized with 0\n",
    "  try {\n",
    "    // Pass in the name of the vendor for which the device you want to query\n",
    "    std::string vendor_name = \"Intel\";\n",
    "    // std::string vendor_name = \"AMD\";\n",
    "    // std::string vendor_name = \"Nvidia\";\n",
    "    CustomDeviceSelector selector(vendor_name);\n",
    "    queue q(selector);\n",
    "    // Call the SYCLParallel with the required inputs and outputs\n",
    "    SYCLParallel(q, input_vect1, input_vect2, out_vect_parallel);\n",
    "  } catch (...) {\n",
    "    // some other exception detected\n",
    "    std::cout << \"Failure\" << \"\\n\";\n",
    "    std::terminate();\n",
    "  }\n",
    "\n",
    "  std::cout\n",
    "      << \"****************************************Multiplying Complex numbers \"\n",
    "         \"in Parallel********************************************************\"\n",
    "      << \"\\n\";\n",
    "  // Print the outputs of the Parallel function\n",
    "  int indices[]{0, 1, 2, 3, 4, (num_elements - 1)};\n",
    "  constexpr size_t indices_size = sizeof(indices) / sizeof(int);\n",
    "\n",
    "  for (int i = 0; i < indices_size; i++) {\n",
    "    int j = indices[i];\n",
    "    if (i == indices_size - 1) std::cout << \"...\\n\";\n",
    "    std::cout << \"[\" << j << \"] \" << input_vect1[j] << \" * \" << input_vect2[j]\n",
    "              << \" = \" << out_vect_parallel[j] << \"\\n\";\n",
    "  }\n",
    "  // Call the Scalar function with the required input and outputs\n",
    "  Scalar(input_vect1, input_vect2, out_vect_scalar);\n",
    "\n",
    "  // Compare the outputs from the parallel and the scalar functions. They should\n",
    "  // be equal\n",
    "\n",
    "  int ret_code = Compare(out_vect_parallel, out_vect_scalar);\n",
    "  if (ret_code == 1) {\n",
    "    std::cout << \"Complex multiplication successfully run on the device\"\n",
    "              << \"\\n\";\n",
    "  } else\n",
    "    std::cout\n",
    "        << \"*********************************************Verification Failed. Results are \"\n",
    "           \"not matched**************************\"\n",
    "        << \"\\n\";\n",
    "\n",
    "  return 0;\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and Run\n",
    "Select the cell below and click run ▶ to compile and execute the code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "## u25cd73e2aa71fc2a1a8182077972ade is compiling SYCL_Essentials Module2 -- SYCL Program Structure sample - 6 of 7 complex_mult.cpp\n",
      "rm -rf bin/complex_mult \n",
      "icpx lab/complex_mult.cpp -fsycl -o bin/complex_mult -Isrc/ -lOpenCL -lsycl\n",
      "bin/complex_mult\n",
      "Target Device: Intel(R) Data Center GPU Max 1100\n",
      "****************************************Multiplying Complex numbers in Parallel********************************************************\n",
      "[0] (2 : 4i) * (4 : 6i) = (-16 : 28i)\n",
      "[1] (3 : 5i) * (5 : 7i) = (-20 : 46i)\n",
      "[2] (4 : 6i) * (6 : 8i) = (-24 : 68i)\n",
      "[3] (5 : 7i) * (7 : 9i) = (-28 : 94i)\n",
      "[4] (6 : 8i) * (8 : 10i) = (-32 : 124i)\n",
      "...\n",
      "[9999] (10001 : 10003i) * (10003 : 10005i) = (-40012 : 200120014i)\n",
      "Complex multiplication successfully run on the device\n"
     ]
    }
   ],
   "source": [
    "! chmod 755 q; chmod 755 run_complex_mult.sh; if [ -x \"$(command -v qsub)\" ]; then ./q run_complex_mult.sh; else ./run_complex_mult.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Lab Exercise: Vector Add\n",
    "\n",
    "Complete the coding excercise below using SYCL Buffer and Accessor concepts:\n",
    "- The code has three vector `vector1` initialized on host\n",
    "- The kernel code increments the `vector1` by 1.\n",
    "- Create a new second `vector2` and initialize to value 20.\n",
    "- Create sycl buffers for the above second vector\n",
    "- In the kernel code, create a second accessor for the second vector buffer\n",
    "- Modify the vector increment to vector add, by adding `vector2` to `vector1`\n",
    "\n",
    "1. Edit the code cell below by following the steps and then click run ▶ to save the code to a file.\n",
    "2. Next run ▶ the cell in the __Build and Run__ section below the code to compile and execute the code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Overwriting lab/vector_add.cpp\n"
     ]
    }
   ],
   "source": [
    "%%writefile lab/vector_add.cpp\n",
    "//==============================================================\n",
    "// Copyright © Intel Corporation\n",
    "//\n",
    "// SPDX-License-Identifier: MIT\n",
    "// =============================================================\n",
    "#include <sycl/sycl.hpp>\n",
    "\n",
    "using namespace sycl;\n",
    "\n",
    "int main() {\n",
    "    const int N = 256;\n",
    "    //# Initialize a vector and print values\n",
    "    std::vector<int> vector1(N, 10);\n",
    "    std::cout<<\"\\nInput Vector1: \";    \n",
    "    for (int i = 0; i < N; i++) std::cout << vector1[i] << \" \";\n",
    "    //# STEP 1 : Create second vector, initialize to 20 and print values\n",
    "    //# YOUR CODE GOES HERE\n",
    "    std::vector<int> vector2(N, 20);\n",
    "    std::cout<<\"\\nInput Vector2: \";\n",
    "    for (int i = 0; i < N; i++) std::cout << vector2[i] << \" \";\n",
    "    //# Create Buffer\n",
    "    buffer vector1_buffer(vector1);\n",
    "    //# STEP 2 : Create buffer for second vector \n",
    "    //# YOUR CODE GOES HERE\n",
    "    buffer vector2_buffer(vector2);\n",
    "    //# Submit task to add vector\n",
    "    queue q;\n",
    "    q.submit([&](handler &h) {\n",
    "      //# Create accessor for vector1_buffer\n",
    "      accessor vector1_accessor (vector1_buffer,h);\n",
    "      //# STEP 3 - add second accessor for second buffer\n",
    "      //# YOUR CODE GOES HERE\n",
    "      accessor vector2_accessor (vector2_buffer,h);\n",
    "      h.parallel_for(range<1>(N), [=](id<1> index) {\n",
    "        //# STEP 4 : Modify the code below to add the second vector to first one\n",
    "        vector1_accessor[index] += vector2_accessor[index];\n",
    "      });\n",
    "   });\n",
    "  //# Create a host accessor to copy data from device to host\n",
    "  host_accessor h_a(vector1_buffer,read_only);\n",
    "  //# Print Output values \n",
    "  std::cout<<\"\\nOutput Values: \";\n",
    "  for (int i = 0; i < N; i++) std::cout<< vector1[i] << \" \";\n",
    "  std::cout<<\"\\n\";\n",
    "  return 0;\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Build and Run\n",
    "Select the cell below and click Run ▶ to compile and execute the code above:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "## u25cd73e2aa71fc2a1a8182077972ade is compiling SYCL_Essentials Module2 -- SYCL Program Structure sample - 7 of 7 vector_add.cpp\n",
      "\n",
      "Input Vector1: 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 \n",
      "Input Vector2: 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 \n",
      "Output Values: 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 30 \n"
     ]
    }
   ],
   "source": [
    "! chmod 755 q; chmod 755 run_vector_add.sh; if [ -x \"$(command -v qsub)\" ]; then ./q run_vector_add.sh; else ./run_vector_add.sh; fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***\n",
    "# Summary\n",
    "\n",
    "In this module you learned:\n",
    "* The fundamental SYCL Classes\n",
    "* How to select the device to offload to kernel workloads\n",
    "* How to write a SYCL program using Buffers, Accessors, Command Group handler, and kernel\n",
    "* How to use the Host accessors and Buffer destruction to do the synchronization\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<html><body><span style=\"color:green\"><h1>Survey</h1></span></body></html>\n",
    "\n",
    "[We would appreciate any feedback you’d care to give, so that we can improve the overall training quality and experience. Thanks! ](https://intel.az1.qualtrics.com/jfe/form/SV_6zljPDDUQ0RBRsx)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "525.6px",
    "left": "28px",
    "top": "137.8px",
    "width": "301.09px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
